A storage system may include one or more storage servers, which may include one or more storage appliances. A storage server may provide services related to the organization of data on storage devices, such as disks. Some of these storage servers are commonly referred to as filers or file servers. An example of such a storage server is any of the Filer products made by Network Appliance, Inc. in Sunnyvale, Calif. The storage server may be implemented with a special-purpose computer or a general-purpose computer. Depending on the application, various storage systems may include different numbers of storage servers.
In a storage system, there may be one or more Redundant Array of Independent Disks (RAID) subsystems. To improve the performance of the disks in a RAID subsystem, preventive maintenance work is performed on the disks periodically. For example, a disk may be periodically scanned for errors, such as media error or hardware error.
Furthermore, if a media error is found in a sector of a disk, one technique merely reassigns data from the defective sector to another sector on the disk. However, simply reassigning the defective sector may not allow the disk to return to error free operation. For instance, a disk having experienced a predetermined number of a particular type of error may need to be physically removed from the system and returned to the vendor for major repair. FIG. 1A shows an existing service route map for an exemplary storage system deployed at a customer site.
Referring to FIG. 1A, the route may include four stops, which may be located in different sites. The four stops in FIG. 1A include the customer site 101, a return merchandise site 103, a customer service depot 105, and a site of a storage device vendor 107. The exemplary storage system including a number of storage devices (e.g., disks) is deployed at the customer site 101. When the customer reports detection of media errors on a storage device, which may be referred to as a failed storage device, the failed storage device is physically decoupled from the storage system and shipped to the return merchandise site 103.
At the return merchandise site 103, the failed storage device is tested again to confirm that one or more media errors exist on the failed storage device. If the storage device fails the test at the return merchandise site 103 again, the media error is confirmed and the storage device is shipped to the vendor of the storage device at the site 107 for repair. Otherwise, the storage device is passed and shipped to the customer service depot 105, which is typically at a different location from the return merchandise site 103. The storage devices shipped to the customer service depot 105 may be shipped back to the customer site 101 to be re-coupled to the storage system at the customer site 101. Alternatively, the storage devices may be shipped from the customer service depot 105 to other customers' sites to be integrated into the storage systems at those sites.
The percentage of storage devices passed at the return merchandise site 103 out of the failed storage devices shipped to the return merchandise site 103 may be referred to as the Not-To-Fail (NTF) rate. In one existing system, the NTF rate can reach approximately 50%. The higher the NTF rate, the higher the cost of servicing the storage system at the customer site 101 because more functioning storage devices are shipped from the customer site 101 to the return merchandise site 103 and the customer service depot 105 unnecessarily. Besides the cost in shipping the storage devices, another problem is that physically moving the storage devices increases the risk of causing mechanical damages to the storage devices moved.
Besides scanning for errors, some of the storage devices today use firmware, which may be upgraded from time to time. However, it is difficult to upgrade the firmware of a storage device when the storage device is a member of a redundancy group (e.g., a RAID group) because downloading the firmware to the storage device would require a service outage in the redundancy group.